What is Data Science?

Summary

  1. Data Scientists need programming, mathematics, and database skills, many of which can be gained through self-learning.

  2. Companies recruiting for a Data Science team need to understand the variety of different roles Data Scientists can play, and look for soft skills like storytelling and relationship building as well as technical skills.

  3. High school students considering a career in Data Science should learn programming, math, databases, and, most importantly practice their skills.

Introduction of Data Science

Data Science is the field of exploring, manipulating, and analyzing data, and using data to answer questions or make recommendations.

As Data Science is not a discipline traditionally taught at universities, contemporary data scientists come from diverse backgrounds such as engineering, statistics, and physics.

The use cases for deep learning include speech recognition and classifying image at a large scale.

According to Dr. White, if someone is coming into a data science team, the first skills they would need are: knowing basic probability and some basic statistics, knowing some algebra and some calculus, understanding relational databases, knowing how to program, at least have some computational thinking.

According to Dr. White, the industrial world is shifting to a new trend, and for high school students to be on the right side of this new trend, his advice to them is:

  • take a course in probability
  • learn how to program
  • learn some math
  • try to start experimenting with building small systems that work and are useful
  • learn statistics
  • Netflix uses machine learning to recommend movies to you based on movies you have already watched and liked or disliked.

So the very first step is measurement. If companies have existing data, then they should start looking at it and cleaning it. If they don’t have existing data, then they need to start collecting it. >> I think to look for a team who love to work as a data scientist. >> The first stop is to have employees, that they are interested on data science. because if you don’t have interest in your company, you will not have engagement. >> Companies should remember, that it’s key to have a team. So it’s not one data scientist, but a team of them, that each of them have strengths in different areas of data science.

The Final Deliverable

The ultimate purpose of analytics is to communicate findings to the concerned who might use these insights to formulate policy or strategy. Analytics summarize findings in tables and plots. The data scientist should then use the insights to build the narrative to communicate the findings. In academia, the final deliverable is in the form of essays and reports. Such deliverables are usually 1,000 to 7,000 words in length. In consulting and business, the final deliverable takes on several forms. It can be a small document of fewer than 1,500 words illustrated with tables and plots, or it could be a comprehensive document comprising several hundred pages. Large consulting firms, such as McKinsey and Deloitte,I routinely generate analytics-driven reports to communicate their findings and, in the process, establish their expertise in specific knowledge domains.

Recruiting for Data Science

Curiosity is one of the most important skills that a data scientist should have in addition to sense of humor and story telling.

When the companies are hiring people for a data science team, maybe a data scientist or an analyst, or a chief data scientist, the tendency would be to find the person who has all the skills, that they know the domain-specific knowledge. They’re excellent in analyzing structured and unstructured data. And they’re great at presenting and they’ve got great storytelling skills. So if you put all this together, you will realize you’re looking for a unicorn. And your odds of finding a unicorn are pretty rare. I think what you need to do to is to see, given the pool of applicants you have, who has the most resonance with your firm’s DNA. Because you can teach analytics skills, anyone can learn analytics skills if they dedicate time and effort to it. But what really matters is who’s passionate about the kind of business that you do. Someone could be a great data scientist in the retail environment, but they may not be that excited about working in IT related firms or working with gigabytes of weblogs. But if someone is excited about those weblogs, if someone is excited about health-related data then they would be able to contribute to your productivity much more so. And I would say if I’m looking for someone, if I have to put together a data science team, I would first look for curiosity. Is that person curious about things not just for data science but anything like, are they curious about why this room is painted a certain way, why do the bookshelves have books, and what kinds of books? They have to have a certain degree of curiosity about everything that is in their vision, that they look at. The second thing is do they have a sense of humor because, you see, you have to have a lighthearted about it. If someone is too serious about it, they probably would take it too seriously, and would not be able to look at the lighter elements. The third thing I think, and I think the last thing that I would look for if I had to have a hierarchy, the last thing I would look for are technical skills. I would go through the social skills, curiosity, and sense of humor. The ability to tell a story. The ability to know that there is a story there. And then once all is there then I would say, well, can you do the technical side of it? And if there is some hope or some sign of some technical skills, I would take them because I can train them in whatever skills they need. But I cannot teach curiosity. I cannot teach storytelling. I cannot certainly, instill sense of humor in anyone. >> I think there’s no hard and fast rule for hiring data scientists. I think it’s going to be a case by case thing. I would say there has to be some sort of technical component, somebody should be able to work with and manipulate the data. They should be able to communicate what they find in the data. I find quite often nobody really cares about the r-square or the confidence interval. So you have to be able to introduce those things and explain something in a compelling way. And they also have to find somebody who is relatable, because data science,

it been typically new means that the person in that role has to make relationships and they have to work across different departments. >> If these data scientist has a good mathematics and statistics background.>> They have to consider like problem solving abilities and analysis. The scientist needs to be good in analyzing problems.>> The persons they are hiring, they should love to play with data. And then they know how to play with the data visualization. They have analytical thinking.>> When a company is hiring anyone to work on a data science team, they need to think about what role that person is going to take. Before a company begins, they need to understand what they want out of their data science team. And then they need to hire to begin it. As they grow a data science team, they need to understand whether they need engineers, architects, designers to work on visualization. Or whether they just need more people who can multiply large matrices. >> From a skills point of view, let’s focus on the technical skills and in that case, first thing would be what kind of a technical platform would you like to adopt? Let’s say you want to work in a structured data environment and let’s say you want to work in market research. Then the type of skills you need are slightly different than someone who would like to work in big data environments. If you want to work in the traditional market research data, structure data environment, your skills should be some statistical knowledge and some knowledge of basic statistical algorithms, maybe some machine learning algorithms. And these are the tools that you would like to develop. If you want to work in big data, then there’s the other aspect of it and that is to be able to store data. So you start with the expertise in storing large amounts of data. And then you look into platforms that allow you to do that. The next step would be to be able to manipulate large amounts of data, and the final step would be to apply algorithms to those large sets of data. So it’s a three-step process. But most likely it starts, most importantly, it starts with where you would like to be, in what field, in what domain. In terms of platforms, let’s you want to be in the traditional predictive analytics environment, and you’re not working with big data, then R or Stata, or Python would be your tools. If you’re working mostly with unstructured data, then Python is most suitable than R. If you’re working with big data, then Hadoop and Spark are the environments that you will be working with. So it all depends upon where you would like to be and what kind of work excites you and then you pick your tools. In addition to technical skills, the second aspect of the data science is to have the ability to communicate. The communication skills or presentation skills. I call them story telling skills, that is that you have your analysis done, now can you tell a great story from it? If you have a very large table, can you synthesize this and make it more appealing that when it goes on the screen, or is it part of a document that it just speaks? It sings the findings and the reader just gets it right there. So the ability to present your findings, either verbally, or in a presentation, or in a document. So those communication and presentation skills are equally important as the technical skills are. When you have a grading side, when you’re presenting your results, imagine you’re driving on a mountain and then there’s a sharp turn. And you can’t see what’s beyond the turn. And then you make that turn and then suddenly, you see a tremendous valley in front of you. And this great sense of awe, that I didn’t know that, right? So when you present your findings and you have this great finding and you communicate it well, this is what people feel because they were not expecting it. They were not aware of it, and then this great sense of happiness that now I know. And I didn’t know this, now I know. And then it empowers them, it gives them ideas, what they can do with this knowledge, this new insight. It’s a great sense of joy. And you are able as a data scientist, you are able to share with your clients because you enabled it.

Tools for Data Science

Languages of Data Science

Data science requires programming.

  • Visual programming

  • Open source

  • Commercial software - leverage open source software

  • Cloud computing

  • python, R, SQL (recommended)

  • Scala, Java, C++, julia

It depends on what problems you need to solve.

Roles in Data Science

  • Business Analyst
  • Database Engineer
  • Data Analyst
  • Data Engineer
  • Data Scientist
  • Research Scientist
  • Software Engineer
  • Statistician
  • Product Manager
  • Project Manager

python

  • People who already know how to program
  • People who want to learn to program
  • Over 80% of data professional worldwide
  • Python is used heavily in data science, AI, and machine learning, web development, and IoT

It is a:

  • General purpose language
  • Large standard library

Python in Data Science

  1. Scientific computing libraries like Pandas, Numpy, Scipy, and Matplotlib
  2. For AI: PyTorch, TensorFlow, Keras, and Scikit-learn
  3. NLP: NLTK (some sort of toolkit)

R

Learning up to three languages can increase your salary.

It is not open source like Python, rather it is a free software.

  • The Open Source Initiative (OSI) champions open source
    • business focused
    • open source software can be modified without sharing the modified source code depending on the open source license
  • the Free Software Foundation (FSF) defines free software
    • more focused on a set of values
    • Free software can always be run, studied, modified, and redistributed with or without changes

Easy to translate from math to code. It is popular in academia. It integrates well with other computer languages. And has stronger object-oriented programming facilities than most statistical computing languages.

SQL

  • SQL = Structured Query Language

How it works:
+ a non-procedural language + scope is limited to querying and managing data

  • it is developed at IBM

Image

A combination of clause, expressions, predicates, queries, and statements.

Image

What makes SQL great

  • Knowing SQL willhelp you get jobs as a business and data analyst and is a must in data engineering and data sceince.
  • When performing operations with SQL the data is accessed directly (without any need to copy it beforehand). This can considerably speed up workflow executions.
  • SQL is the interpreter between you and the database
  • SQL is a ANSI standard, which means if you learn SQL and use it with one database you will be able to easily apply your SQL knowledge with many other databases

SQL databases available:
+ MySQL + PostgreSQL + SQLite + Oracle + IBMDB2 + MariaDB

Other Languages

  • Scala
    • provides support for FP
    • extension to Java
    • Scalable Language
    • Apache Spark: designed to be faster than Hadoop
  • Java
    • tried-and-true general-purpose OOP language
    • Hadoop - manages data processing and storage for big data applications running in clustered systems
  • C++
    • extension of C
    • develop programs that feed data to customers in real-time
    • TensorFlow
    • MongoDB - a NoSQL database for big data management
    • Caffe - a deep learning algorithm repository
  • julia
    • designed at MIT in 2012
    • speedy development like Python or R while producing programs that run as fast as C or Fortran programs would
    • Julia DB
  • JS
    • extends beyond the browser with Node.js and other serve side approaches
    • TensorFlow.js
    • R-js: makes linear algebra possible in Typescript
  • php
  • GO
  • Ruby
  • Visual Basic

Categories of Data Science Tools

Image

The ones with green labels can be done via cloud service.

  • Data Asset Management:
    • Data Management: process of persisting and retrieving data
    • Data Integration and Transformation: Extract, Transform, and Load - the process of retrieving data from remote data management systems
      • also, transforming data and loading it into a local data management system
    • Data Visualization: part of an initiaal data exploration process, as well as being part of a final deliverable
    • Model Building: Create a machine learning or deep learning model using an algorithm with a lot of data
    • Model Deployment: Make models available to third-party applications
    • Model Monitoring and Assessment: ensures continuous performance quality checks on the deployed models
  • Code Asset Management: uses versioning and other collaborative features to facilitate teamwork
  • Development Environments: IDE, tools that help the data scientists to implement, execute, test, and deploy their works
    • Execution Environments: tools where data processing, model training, and deployment take place

Open Source Tools for Data Science

In part one of this two-part series, we’ll cover data management, open source data integration, transformation, and visualization tools. The most widely used open source data management tools are relational databases such as MySQL and PostgreSQL; NoSQL databases such as MongoDB Apache CouchDB, and Apache Cassandra; and file-based tools such as the Hadoop File System or Cloud File systems like Ceph. Finally,Elasticsearch is mainly used for storing text data and creating a search index for fast document retrieval. The task of data integration and transformation in the classic data warehousing world is called ETL, which stands for “extract, transform, and load.” These days, data scientists often propose the term “ELT” – Extract, Load, Transform“ELT”, stressing the fact that data is dumped somewhere and the data engineer or data scientist themself is responsible for data. Another term for this process has now emerged: “data refinery and cleansing.” Here are the most widely used open source data integration and transformation tools: Apache AirFlow, originally created by AirBNB; KubeFlow, which enables you to execute data science pipelines on top of Kubernetes; Apache Kafka, which originated from LinkedIn; Apache Nifi, which delivers a very nice visual editor; Apache SparkSQL (which enables you to use ANSI SQL and scales up to compute clusters of 1000s of nodes), and NodeRED, which also provides a visual editor. NodeRED consumes so little in resources that it even runs on small devices like a Raspberry Pi. We’ll now introduce the most widely used open source data visualization tools. We have to distinguish between programming libraries where you need to use code and tools that contain a user interface. The most popular libraries are covered in the next videos. A similar approach uses Hue, which can create visualizations from SQL queries. Kibana, a data exploration and visualization web application, is limited to Elasticsearch (the data provider). Finally, Apache Superset is a data exploration and visualization web application. Model deployment is extremely important. Once you’ve created a machine learning model capable of predicting some key aspects of the future, you should make that model consumable by other developers and turn it into an API. Apache PredictionIO currently only supports Apache Spark ML models for deployment, but support for all sorts of other libraries is on the roadmap. Seldon is an interesting product since it supports nearly every framework, including TensorFlow, Apache SparkML, R, and scikit-learn. Seldon can run on top of Kubernetes and Redhat OpenShift. Another way to deploy SparkML models is by using MLeap. Finally, TensorFlow can serve any of its models using the TensorFlow service. You can deploy to an embedded device like a Raspberry Pi or a smartphone using TensorFlow Lite, and even deploy to a web browser using TensorFlow dot JS. Model monitoring is another crucial step. Once you’ve deployed a machine learning model, you need to keep track of its prediction performance as new data arrives in order to maintain outdated models. Following are some examples of model monitoring tools: ModelDB is a machine model metadatabase where information about the models are stored and can be queried. It natively supports Apache Spark ML Pipelines and scikit-learn. A generic, multi-purpose tool called Prometheus is also widely used for machine learning model monitoring, although it’s not specifically made for this purpose. Model performance is not exclusively measured through accuracy. Model bias against protected groups like gender or race is also important. The IBM AI Fairness 360 open source toolkit does exactly this. It detects and mitigates against bias in machine learning models. Machine learning models, especially neural-network-based deep learning models, can be subject to adversarial attacks, where an attacker tries to fool the model with manipulated data or by manipulating the model itself. The IBM Adversarial Robustness 360 Toolbox can be used to detect vulnerability to adversarial attacks and help make the model more robust. Machine learning modes are often considered to be a black box that applies some mysterious “magic.” The IBM AI Explainability 360 Toolkit makes the machine learning process more understandable by finding similar examples within a dataset that can be presented to a user for manual comparison. The IBM AI Explainability 360 Toolkit can also illustrate training for a simpler machine learning model by explaining how different input variables affect the final decision of the model. Options for code asset management tools have been greatly simplified: For code asset management – also referred to as version management or version control – Git is now the standard. Multiple services have emerged to support Git, with the most prominent being GitHub, which provides hosting for software development version management. The runner-up is definitely GitLab, which has the advantage of being a fully open source platform that you can host and manage yourself. Another choice is Bitbucket. Data asset management, also known as data governance or data lineage, is another crucial part of enterprise grade data science. Data has to be versioned and annotated with metadata. Apache Atlas is a tool that supports this task. Another interesting project, ODPi Egeria, is managed through the Linux Foundation and is an open ecosystem. It offers a set of open APIs, types, and interchange protocols that metadata repositories use to share and exchange data. Finally, Kylo is an open source data lake management software platform that provides extensive support for a wide range of data asset management tasks. This concludes part one of this two-part series. Now let’s move on to part two.

In this section, we’ll cover the development environment, open source data integration, transformation, and visualization tools. One of the most popular current development environments that data scientists are using is “Jupyter.” Jupyter first emerged as a tool for interactive Python programming; it now supports more than a hundred different programming languages through “kernels.” Kernels shouldn’t be confused with operating system kernels. Jupyter kernels are encapsulating the different interactive interpreters for the different programming languages. A key property of Jupyter Notebooks is the ability to unify documentation, code, output from the code, shell commands, and visualizations into a single document. JupyterLab is the next generation of Jupyter Notebooks and in the long term, will actually replace Jupyter Notebooks. The architectural changes being introduced in JupyterLab makes Jupyter more modern and modular. From a user’s perspective, the main difference introduced by JupyterLab is the ability to open different types of files, including Jupyter Notebooks, data, and terminals. You can then arrange these files on the canvas. Although Apache Zeppelin has been fully reimplemented, it’s inspired by Jupyter Notebooks and provides a similar experience. One key differentiator is the integrated plotting capability. In Jupyter Notebooks, you are required to use external libraries in Apache Zeppelin, and plotting doesn’t require coding. You can also extend these capabilities by using additional libraries. RStudio is one of the oldest development environments for statistics and data science, having been introduced in 2011. It exclusively runs R and all associated R libraries. However, Python development is possible and R is therefore tightly integrated into this tool to provide an optimal user experience. RStudio unifies programming, execution, debugging, remote data access, data exploration, and visualization into a single tool. Spyder tries to mimic the behaviour of RStudio to bring its functionality to the Python world. Although Spyder does not have the same level of functionality as RStudio, data scientists do consider it an alternative. But in the Python world, Jupyter is used more frequently. This diagram shows how Spyder integrates code, documentation, visualizations, and other components into a single canvas. Sometimes your data doesn’t fit into a single computer’s storage or main memory capacity. That’s where cluster execution environments come in. The well known cluster-computing framework Apache Spark is among the most active Apache projects and is used across all industries, including in many Fortune 500 companies. The key property of Apache Spark is linear scalability. This means, if you double the number of servers in a cluster, you’ll also roughly double its performance. After Apache Spark began to gain market share, Apache Flink was created. The key difference between Apache Spark and Apache Flink is that Apache Spark is a batch data processing engine, capable of processing huge amounts of data file by file. Apache Flink, on the other hand, is a stream processing image, with its main focus on processing real-time data streams. Although engine supports both data processing paradigms, Apache Spark is usually the choice in most use cases. One of the latest developments in the data science execution environments is called “Ray,” which has a clear focus on large-scale deep learning model training. Let’s look at open source tools for data scientists that are fully integrated and visual. With these tools, no programming knowledge is necessary. Most important tasks are supported by these tools; these tasks include data integration, transformation, data visualization, and model building. KNIME originated at the University of Konstanz in 2004. As you can see, KNIME has a visual user interface with drag-and-drop capabilities. It also has built-in visualization capabilities. Knime can be be extended by programming in R and Python, and has connectors to Apache Spark. Another example of this group of tools is Orange. It’s less flexible than KNIME, but easier to use. In this video, you’ve learned about the most common data science tasks and which open source tools are relevant to those tasks. In the next video, we’ll describe some established commercial tools that you’ll encounter in your data science experience.

  • Data Asset Management:
    • Data Management
      • Relational databases: MySQL and PostgreSQL
      • NoSQL databases: MongoDB, Apache, CouchDB, and Apache Cassandra
      • File-based tools: Hadoop File System or Cloud File systems like Ceph
      • Storing text data and creating a serach index for fast document retrieval: Elasticseasrch
    • Data Integration and Transformation (ETL)
      • or “ELT” or “data refinery and cleansing”
      • Apache AirFlow
      • KubeFlow
      • Apache Kafka
      • Apache Nifi
      • Apache SparkSQL
      • NodeRED
    • Data Visualization
      • Hue: create visualization from SQL queries
      • Kibana
      • Apache Superset
    • Model Deployment
      • Apache PredictioIO
      • Seldon - supports every framework like TensorFlow, Apache SparkML, R, and scikit-learn
      • MLeap
      • TensorFlow
    • Model Monitoring and Assessment
      • ModelDB
      • Prometheus
      • IBM AI Fairness 360
      • IBM Adversarial Robustness 360 Toolbox
      • IBM AI Explainability 360 Toolkit
    • Data Asset Management
      • Apache Atlas
      • OPEI Egeria
      • Kylo
  • Code Asset Management:
    • Git
  • development environment
    • Jupyter
    • Apache Zeppelin
    • RStudio
    • Spyder
    • When your data doesn’t fit into a single computer’s storage or main memory capacity → cluster execution environments
      • Apache Spark
        • linear scalability
        • a batch data processing engine, capable of processing huge amounts of data file by file
      • Apache Flink
        • stream processin image
      • Ray
        • clear focus on large-scale deep learning model training
  • open source data integration
    • KNIME
    • Orange
  • transformation
  • visualization tools

Commercial Tools for Data Science

  • Data Management:
    • Oracle Database
    • Microsoft SQL Server
    • IBM Db2

When we focus on commercial data integration tools, we’re talking about “ETL” tools. We + Gartner Magic Quadrant, Informatica Powercenter, IBM InfoSphere DataStage + SAP + Oracle + SAS + Talend + Microsoft + Watson Studio Desktop

Commercial environment - data visualizations are utilizing business intelligence, or “BI”, tools.

  • Tableau
  • Microsoft Power BI
  • IBM Cognos Analytics

When asking “How can different columns in a table relate to each other?” - Watson Studio Desktop

Cloud Based Tools for Data Science

  • Fully Integrated Visual Tools and Platforms
    • Watson Studio + Watson OpenScale
    • Azure Machine Learning
    • H20.ai driverless AI

SaaS - Software as a Service - the cloud provider operates the tool for you in the cloud.

e.g. 

  • AWS DynamoDB - NoSQL Database that allows storage and retrieval of data in a key-value or a document store format
    • JSON
  • Cloudant - database as a service
    • based on the open source Apache CouchDB
  • Db2 (IBM)

When it comes to commercial data integration tools, we talk not only about “extract, transform, and load,” or “ETL” tools, but also about “extract, load, and transform,” or “ELT,” tools. This means the transformation steps are not done by a data integration team but are pushed towards the domain of the data scientist or data engineer.

  • Informatica Cloud Data Integration
  • IBM’s Data Refinery (part of IBM Watson Studio)

Libraries for Data Science

  • Python Libraries
    1. Scientific computing Libraries in Python

      Libraries can sometimes be called “frameworks”.

      • Pandas: Dataframe
        • built on Numpy
      • NumPy: Arrays & matrices
    2. Visualization Libraries in Python

      • Matplotlib: plots & graphs, most popular
      • Seaborn
        • based on Matplotlib
        • heat maps, time series, violin plots
    3. High Level- Machine Learning and Deep Learning (meaning that you don’t have to worry about the details, which also means that it is hard to improve)

      • Scikit-learn: for ML: regression, classification
      • Keras: Deep Learning Neural Networks
      • TensorFlow: Deep Learning: Production and Deployment
      • PyTorch: Deep Learning: used for experimentation
    4. Deep Learning Libraries in Python

  • Libraries Used in other languages
    • Apache Spark: process data in parallel

      • pandas
      • numpy
      • scikit-learn
    • Scala

      • Vegas
      • Big DL: for deep learning
    • R R has been the de-facto standard for open source data science but it is now being superseded by Python.

      • Ggplot2
      • Keras, TensorFlow

Application Programming Interfaces (API)

What is an API: lets two pieces of software talk to each other.

Image

API Libraries + TensorFlow

REST API + REST API + enabling you to communicate using the Internet, taking advantage of storage, greater data access, AI algorithms, and many other resources. + RE = Representational + S = State + T = Transfer + your program = client

Image

An API lets two pieces of software talk to each other. For example you have your program, you have some data, you have other software components. You use the API to communicate with the other software components.You don’t have to know how the API works, you just need to know its inputs and outputs. Remember, the API only refers to the interface, or the part of the library that you see. The “library” refers to the whole thing. Consider the pandas library. Pandas is actually a set of software components, many of which are not even written in Python. You have some data. You have a set of software components. We use the pandas API to process the data by communicating with the other software components. There can be a single software component at the back end, but there can be a separate API for different languages. Consider TensorFlow, written in C++. There are separate APIs in Python, JavaScript, C++ Java, and Go. The API is simply the interface. There are also multiple volunteer-developed APIs for TensorFlow; for example Julia, MATLAB, R, Scala, and many more. REST APIs are another popular type of API. They enable you to communicate using the internet, taking advantage of storage, greater data access, artificial intelligence algorithms, and many other resources. The RE stands for “Representational,” the S stands for “State,” the T stand for “Transfer.” In rest APIs, your program is called the “client.” The API communicates with a web service that you call through the internet. A set of rules governs Communication, Input or Request, and Output or Response. Here are some common API-related terms. You or your code can be thought of as a client. The web service is referred to as a resource. The client finds the service through an endpoint. The client sends the request to the resource and the response to the client. HTTP methods are a way of transmitting data over the internet We tell the REST APIs what to do by sending a request. The request is usually communicated through an HTTP message. The HTTP message usually contains a JSON file, which contains instructions for the operation that we would like the service to perform. This operation is transmitted to the web service over the internet. The service performs the operation. Similarly, the web service returns a response through an HTTP message, where the information is usually returned using a JSON file. This information is transmitted back to the client. The Watson Speech to Text API is an example of a REST API. This API converts speech to text. In the API call, you send a copy of the audio file to the API; this process is called a post request. The API then sends the text transcription of what the individual is saying. The API is making a get request. The Watson Language-Translator API provides another example. You send the text you would like to translate into the API, the API translates the text and sends the translation back to you. In this case we translate English to Spanish. In this video, we’ve discussed what an API is, API Libraries, REST APIs, including Request and Response.

Open Data Sources

Community Data License Agreement
+ CDLA-Sharing: Permission to use modify data; publication only under same terms + CDLA-Permissive: Permission to use and modify data; no obligations

Machine Learning Models

  • Model
    • Data can contain a wealth of information
    • ML uses algorithms (models) to identify patterns in data (model training)
    • A model must be trained on data before it can be used to make predictions
      • learn from past data
    • types
      • supervised
      • unsupervised
      • reinforcements
  • Supervised
    • data is labeled and model trained to make correct predictions
    • regression
      • predict real numerical values
      • home sales prices, stock market prices
    • and classification problems
      • classify things into categories
      • email spam filters, fraud detection, image classification
  • Unsupervised
    • Data is not labeled
    • model tries to identify patterns without external help
    • clustering
      • purchase recommendation
    • Anomaly detection
      • identifies outliers in a data set, such as fraudulent credit card transactions or suspicious online log-in attempts
  • Reinforcements
    • conceptually similar to human learning processes
    • learn from rewards (successful outcomes)
    • GO, Chess,
  • Deep Learning
    • Emulate how the human brain works
    • Applications
      • NLP
      • Image, audio, and video analysis
      • Time series forecasting
      • etc
    • requires typically very large datasets of labeled data and is compute intensive
    • Models
      • build from scratch or download from public model repositories
      • Built using frameworks
      • popular model repositories
        • most frameworks provides a “model zoo”s
        • ONNX model zoo

cf

Jupyter Notebook and JupyterLab

Jupyter Architecture

Image

basic architectural design of the Jupyter ecosystem. Jupyter implements a two-process model, with a kernel and a client. The client is the interface offering the user the ability to send the code to a kernel. The kernel executes the code and returns the result to the client for display. The client is the browser when using a Jupyter notebook. Jupyter notebooks represent your code, metadata, contents, and outputs. When saved it uses a dot I Pi NB (.ipynb) extension and a JSON structure. When you, the user, saves it, it is sent from your browser to the Notebook server which saves the notebook file on a disk as a JSON file with a dot I PI NB (.ipynb) extension. The Notebook server is responsible for saving and loading the notebooks. The kernel gets sent the cells of code when the user runs them. Jupyter also has an architecture of how it converts files to other formats. It uses a tool called NB convert. For example, if we want to convert a Notebook file into an HTML file, it will go through the following: The Notebook is modified by a preprocessor, an exporter converts the notebook to the new file format, and a postprocessor will work on the file produced by exporting it. After conversion, when you give the URL of the HTML file, it first fetches the notebook, converts it HTML, and displays the file to you in a HTML file. You should now be familiar with: The two-process model implementation of Jupyter, how Notebook servers communicate with kernels and clients, and the architectural design of how notebook files are converted to other files.

Jupyter Notebooks on the Internet

There are thousands of interesting jupyter notebooks available on the internet for you to learn from. One of the best sources is: https://github.com/jupyter/jupyter/wiki/A-gallery-of-interesting-Jupyter-Notebooks

It is important to notice that you can download such notebooks to your local computer or import them to a cloud based notebook tool so that you can rerun, modify and follow along what’s explained in the notebook.

Very often jupyter notebook are already shared in a rendered view. This means, that you can look at them as if they were running locally on you machine. But sometimes, folks only share a link to the jupyter file (which you can make out by the *.ipynb extention). In this case you can just grab the URL to that file and past it to the NB-Viewer => https://nbviewer.jupyter.org/

The list above gives you a very nice start with a huge collection of materials to explore. Therefore it’s maybe more useful to give you some pointers to interesting notebooks. As we have covered some toy examples with toy data in the labs, let me just point to some work which uses these data and goes further down the road of data science. In addition, as we’ve covered the different tasks in data science we’ll also provide an exemplar notebook for each of those.

First you start with exploratory data analysis, so this notebook is highly recommended to have a look at: https://nbviewer.jupyter.org/github/Tanu-N-Prabhu/Python/blob/master/Exploratory_data_Analysis.ipynb

For data integration / cleansing at a smaller scale, the python library pandas is often used. Please have a look at this notebook: https://towardsdatascience.com/data-cleaning-with-python-using-pandas-library-c6f4a68ea8eb

If you want to already experience what clustering is, have a look at this: https://nbviewer.jupyter.org/github/temporaer/tutorial_ml_gkbionics/blob/master/2%20-%20KMeans.ipynb

And finally, if you want to go for a more in-depth notebook on the iris dataset have a look here: https://www.kaggle.com/lalitharajesh/iris-dataset-exploratory-data-analysis

GitHub

Git and GitHub, which are popular environments among developers and data scientists for performing version control of source code files and projects and collaborating with others. You can’t talk about Git and GitHub without a basic understanding of what version control is.

A version control system allows you to keep track of changes to your documents. This makes it easy for you to recover older versions of your document if you make a mistake, and it makes collaboration with others much easier. Here is an example to illustrate how version control works. Let’s say you’ve got a shopping list and you want your roommates to confirm the things you need and add additional items. Without version control, you’ve got a big mess to clean up before you can go shopping. With version control, you know EXACTLY what you need after everyone has contributed their ideas.

Git is free and open source software distributed under the GNU General Public License. Git is a distributed version control system, which means that users anywhere in the world can have a copy of your project on their own computer; when they’ve made changes, they can sync their version to a remote server to share it with you. Git isn’t the only version control system out there, but the distributed aspect is one of the main reasons it’s become one of the most common version control systems available. Version control systems are widely used for things involving code, but you can also version control images, documents, and any number of file types. You can use Git without a web interface by using your command line interface, but GitHub is one of the most popular web-hosted services for Git repositories. Others include GitLab, BitBucket, and Beanstalk. There are a few basic terms that you will need to know before you can get started. The SSH protocol is a method for secure remote login from one computer to another. A repository contains your project folders that are set up for version control. A fork is a copy of a repository. A pull request is the way you request that someone reviews and approves your changes before they become final. A working directory contains the files and subdirectories on your computer that are associated with a Git repository. There are a few basic Git commands that you will always use. When starting out with a new repository, you only need create it once: either locally, and then push to GitHub, or by cloning an existing repository by using the command “git init”.

“git add” moves changes from the working directory to the staging area. “git status” allows you to see the state of your working directory and the staged snapshot of your changes. “git commit” takes your staged snapshot of changes and commits them to the project. “git reset” undoes changes that you’ve made to the files in your working directory. “git log” enables you to browse previous changes to a project. “git branch” lets you create an isolated environment within your repository to make changes. “git checkout” lets you see and change existing branches. “git merge” lets you put everything back together again. To learn how to use Git effectively and begin collaborating with data scientists around the world, you will need to learn the essential commands. Luckily for us, GitHub has amazing resources available to help you get started. Go to try.github.io to download the cheat sheets and run through the tutorials. In the following modules, we’ll give you a crash course on setting up your local environment and getting started on a project.

git is a distributed version control system - meaning that anyone in the world can access your repositories

Glossary

  • SSH protocol - a method for secure remote login from one computer to another
  • Repository - The folders of your project that are set up for version control
  • Fork - A copy of a repository
  • Pull request - the process you use to request that someone reviews and approves your changes before they become final
  • Working Directory - a directory on your file system, including its files and subdirectories, that is associated with a git repository

Command

  • git log
  • git reset: git-reset - Reset current HEAD to the specified state 1
  • git checkout

Branch

how to create and merge a branch using the GitHub web interface. A branch is a snapshot of your repository to which you can make changes. It is a copy of the master branch and can be used to develop and test changes to the workflow before merging it back to the master branch. In Git and GitHub, there is a main branch. The main branch which is called Master, is the one with deployable code and the official working version of your project. It is meant to be stable and it is always advisable never to push any code that is not tested to master. Many times, we want to make changes to the code and workflow in the master branch. That is when we create a copy of the Master branch. Let’s call it Child Branch. We will then create a copy of the workflow to the child branch in the child branch, changes and experiments are done. We will build and make edits, test the changes and when we are satisfied with the changes we will merge it back to the master branch where we prepare the model for deployment. We can see that all of this is done outside of the main branch and until we merge, changes will not be made to the workflow before we branched. To ensure that changes done by one member, does not impede or affect the flow of work of other members, multiple branches can be created and merged appropriately to master after the workflow is properly tested and approved. To create branches in GitHub, let’s look at this repository. There is currently one branch in the repository. I want to make some changes, but I don’t want to alter the master in case something goes wrong. We will create a branch. To do that, we will click the drop-down arrow and create a new branch. Let’s name it - child branch and then we will click enter. The repository now has two branches, the Master and the Child branch. You can check this by selecting Child branch in the Branch selector drop-down list. Whatever was in the Master branch was copied to the child branch. But we can add files in the child branch without adding any files to the master branch. To add a file, make sure Child branch is selected in the branch selector drop-down list. Click on create new file. In the space provided, name the file - we will name it testchild.py and then we will add a few lines of code. We will print the statement – Inside child branch. At the bottom of the screen, we will see a section called Commit new file. Commit messages are very important as it helps to keep track of the changes that were made. It is important to add a descriptive commit message so that other team members can understand it. Here we will add a commit message, Create testchild.py, then we will commit the new file. The file gets added to only the child branch. We can check this by going to the master branch by clicking ‘master’ from the Branch selector menu and here we can see that the new file is not added to the master branch. After we have created the new file, tested and made sure that is up to standards. We then want to merge the changes in the child branch to reflect in the master branch. To merge the changes, we will first have to create a pull request, also known as a PR. A pull request in simple terms is a way to notify other team members of your changes and edits and ask them for review so they can be pulled or merged into the master branch. Pull requests are the heart of collaboration on GitHub. When you open a pull request, you’re proposing your changes and requesting that someone review and pull in your contribution and merge them into the target branch. Pull requests show the differences of the content from both branches. To open a pull request and see the differences between the branches, click on the Compare and pull request button. If you scroll down to the bottom of the screen, you will see something like this that shows you the difference between both branches. As you can see on the screen it shows that one file has changed and the file has two additions, which are the two lines we added to the file and 0 deletions. We will now create the pull request. Add the title and an optional comment for the pull request. Click Create Pull request to create the pull request. You can assign team members to review and approve pull requests. On the next page you will see this image. If you are okay with the changes, click on Merge pull request and click confirm. You will get a confirmation that the pull request has been successfully merged. You can now delete the branch if you no longer need to make any edits or add new information. Now, the child branch has completely merged with the Master branch. You can check the Master branch and we can now see it contains the testchild.py file. You should now be familiar with how to create and merge branches using the web interface

IBM Watson Studio

Every business wants to work smarter, and to do that you need to tap into your company’s greatest resource, your data. But extracting the full value out of your data isn’t always an easy process. First. you end up juggling an incredibly large and complex collection of tools that are used for finding and cleaning data, analyzing and generating visualizations of that data, and using the data to build and deploy machine learning models. And to make matters worse these tools are often a time drain to individually manage, and can be difficult to integrate into your system, which can really slow down the workflow. But not anymore. Using Watson Studio you can simplify your data projects with a streamlined process, that allows you to extract value and insights from your data to help your business get smarter, faster. It delivers an easy-to-use collaborative data science and machine learning environment for building and training models, preparing and analyzing data, and sharing insights, all in one place. Watson Studios easy to create visualizations and drag-and-drop code put the power of database decision-making into the hands of any member of your organization with no need for IT assistance. And if you need access to open source tools, the environment offers some of the most popular and powerful ones available. Watson Studio single environment also creates a workflow that’s incredibly efficient so data scientists can share assets and work to solve problems within the system rather than starting from scratch every time a new issue arises. And developers can use that efficiency to quickly dive into building machine learning and deep learning algorithms. In fact, in the area of deep learning, Watson Studio supports some of the most popular frameworks and can deploy that deep learning on to the latest GPUs to help accelerate modeling by making it easier to use. The environments built-in neural network modeler also helps you build models with a simplified graphical interface even if you don’t have the dedicated resources to build a model from scratch, Watson’s Studio can help you get started with modeling templates for areas such as visual recognition, language classification, and other tools from IBM Watson services. Because Watson Studio is seamlessly integrated with the IBM Watson Knowledge Catalog, an intelligent asset discovery tool, you can transform data and models into trusted enterprise resources and collaborate with confidence, without compromising compliance, security or access control. Watson Studio provides many benefits for organizations helping to infuse AI into the business and drive innovation. You can train Watson Studio with embedded AI services including watson visual recognition. You can customize your models and deploy them as APIs or Core ML by using open source tools like Jupyter, Notebook, Anaconda and RStudio. Watson Studio supports most popular code libraries as well as no code visual modeling with neural network modeler for designing neural architectures using the most popular deep learning frameworks. In Watson Studio you can interactively discover, cleanse, and transform your data using data refinery. It helps you understand the quality and distribution of your data with built-in charts and statistics, and provides visualized results through interactive dashboards. Watson Studio includes an intuitive drag-and-drop interface that enables a non programmer to speed up the bottle building process by visually selecting, configuring, designing and auto coding neural networks. From development and training to production and evaluation, Watson Studio tracks your models over time to ensure you have the best performance for any given task using the best solutions across the entire lifecycle of your machine learning models

Data Science Methodology

  1. From Problem to Approach and from Requirements to Collection
  2. From Understanding to Preparation and from Modeling to Evaluation
  3. From Deployment to Feedback

Image

A training set is a set of historical data in which the outcomes are already known. The training set acts like a gauge to determine if the model needs to be calibrated. In this stage, the data scientist will play around with different algorithms to ensure that the variables in play are actually required. The success of data compilation, preparation and modelling, depends on the understanding of the problem at hand, and the appropriate analytical approach being taken. The data supports the answering of the question, and like the quality of the ingredients in cooking, sets the stage for the outcome. Constant refinement, adjustments and tweaking are necessary within each step to ensure the outcome is one that is solid. In John Rollins’ descriptive Data Science Methodology, the framework is geared to do 3 things: First, understand the question at hand. Second, select an analytic approach or method to solve the problem, and third, obtain, understand, prepare, and model the data. The end goal is to move the data scientist to a point where a data model can be built to answer the question.

Python for Data Science, AI & Development

Machine Learning with Python

Python for Machine Learning

Python is a popular and powerful general purpose programming language that recently emerged as the preferred language among data scientists. You can write your machine-learning algorithms using Python, and it works very well. However, there are a lot of modules and libraries already implemented in Python, that can make your life much easier. We try to introduce the Python packages in this course and use it in the labs to give you better hands-on experience. The first package is NumPy which is a math library to work with N-dimensional arrays in Python. It enables you to do computation efficiently and effectively. It is better than regular Python because of its amazing capabilities. For example, for working with arrays, dictionaries, functions, datatypes and working with images you need to know NumPy. SciPy is a collection of numerical algorithms and domain specific toolboxes, including signal processing, optimization, statistics and much more. SciPy is a good library for scientific and high performance computation. Matplotlib is a very popular plotting package that provides 2D plotting, as well as 3D plotting. Basic knowledge about these three packages which are built on top of Python, is a good asset for data scientists who want to work with real-world problems. If you’re not familiar with these packages, I recommend that you take the data analysis with Python course first. This course covers most of the useful topics in these packages. Pandas library is a very high-level Python library that provides high performance easy to use data structures. It has many functions for data importing, manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and timeseries. SciKit Learn is a collection of algorithms and tools for machine learning which is our focus here and which you’ll learn to use within this course. As we’ll be using SciKit Learn quite a bit in the labs, let me explain more about it and show you why it is so popular among data scientists. SciKit Learn is a free Machine Learning Library for the Python programming language. It has most of the classification, regression and clustering algorithms, and it’s designed to work with a Python numerical and scientific libraries: NumPy and SciPy. Also, it includes very good documentation. On top of that, implementing machine learning models with SciKit Learn is really easy with a few lines of Python code. Most of the tasks that need to be done in a machine learning pipeline are implemented already in Scikit Learn including pre-processing of data, feature selection, feature extraction, train test splitting, defining the algorithms, fitting models, tuning parameters, prediction, evaluation, and exporting the model. Let me show you an example of how SciKit Learn looks like when you use this library. You don’t have to understand the code for now but just see how easily you can build a model with just a few lines of code. Basically, machine-learning algorithms benefit from standardization of the dataset. If there are some outliers or different scales fields in your dataset, you have to fix them. The pre-processing package of SciKit Learn provides several common utility functions and transformer classes to change raw feature vectors into a suitable form of vector for modeling. You have to split your dataset into train and test sets to train your model and then test the model’s accuracy separately. SciKit Learn can split arrays or matrices into random train and test subsets for you in one line of code. Then you can set up your algorithm. For example, you can build a classifier using a support vector classification algorithm. We call our estimator instance CLF and initialize its parameters. Now you can train your model with the train set by passing our training set to the fit method, the CLF model learns to classify unknown cases. Then we can use our test set to run predictions, and the result tells us what the class of each unknown value is. Also, you can use the different metrics to evaluate your model accuracy. For example, using a confusion matrix to show the results. And finally, you save your model. You may find all or some of these machine-learning terms confusing but don’t worry, we’ll talk about all of these topics in the following videos. The most important point to remember is that the entire process of a machine learning task can be done simply in a few lines of code using SciKit Learn. Please notice that though it is possible, it would not be that easy if you want to do all of this using NumPy or SciPy packages. And of course, it needs much more coding if you use pure Python programming to implement all of these tasks. Thanks for watching.

Difference between AI, ML, and DL

  • AI components:
    • computer vision
    • Language processing
    • creativity
    • Etc.
  • Machine Learning: statistical side of AI
    • Classification
    • clustering
    • neural network
    • etc.
  • Deep Learning: Deeper level, learning on its own

Python for Machine Learning

  • NumPy
  • SciPy
  • matplotlib
  • pandas
  • scikit-learn
    • free software machine learning library
    • classification, regression, and clustering algorithms
    • works with NumPy and SciPy
    • Great doc.
    • Easy to implement

Image

from sklearn import preprocessing 
X = preprocessing.StandardScaler().fit(X).transform(X)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

from sklearn import svm
clf = svm.SVC(gamma=.001, C=100.)
clf.fit(X_train, y_train)
clf.predict(X_test)

from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, yhat, labels=[1, 0]))

import pickle
s = pickle.dumps(clf) # save model

Supervised vs Unsupervised

Image

Regression

Regression algorithms

  • Ordinal Regression
  • Poisson Regression
  • Fast forest quantile regression
  • Linear, Polynomial, Lasso, Stepwise, Ridge regression
  • Bayesian linear regression
  • Neural network regression
  • decision forest regression
  • boosted decision tree regression
  • KNN (K-nearest neighbors)

Training Accuracy * High training accuracy isn’t necessarily a good thing * Result of over-fitting * Over-fit: the model is overly trained to the dataset, which may capture noise and produce a non-generalized model

Out-of-Sample Accuracy * it’s important that our models have a high, out-of-sample accuracy * improvement using train/test split

MAE \[ MAE = \frac{1}{n}\Sigma_{j=1}^{n} \vert y_j - \hat y_j\vert \]

MSE \[ MSE = \frac{1}{n}\Sigma_{j=1}^{n} ( y_j - \hat y_j)^2 \]

More popular. It focuses more towards larger errors.

RMSE

\[ RMSE = \sqrt{\frac{1}{n}\Sigma_{j=1}^{n} ( y_j - \hat y_j)^2} \]

  • Interpretable as the same unit

RAE Relative Absolute Error, aka Residual Sum of Square.

\[ RAE = \frac{\Sigma_{j=1}^{n} \vert y_j - \hat y_j\vert}{\Sigma_{j=1}^{n} \vert y_j - \bar y_j\vert} \]

RSE

\[ RAE = \frac{\Sigma_{j=1}^{n} (y_j - \hat y_j)^2}{\Sigma_{j=1}^{n} ( y_j - \bar y_j)^2} \]

  • Used to calculate \(R^2\)
    • \(R^2 = 1 - RSE\)
    • The higher the \(R^2\), the better the model fits your data

Python Code Simple Linear Regression

See relation:

plt.scatter(train.ENGINESIZE, train.CO2EMISSIONS,  color='blue')
plt.xlabel("Engine size")
plt.ylabel("Emission")
plt.show()

Creating train and test dataset:

msk = np.random.rand(len(df)) < 0.8
print(msk)
train = cdf[msk]
test = cdf[~msk]

Modeling:

from sklearn import linear_model
regr = linear_model.LinearRegression()
train_x = np.asanyarray(train[['ENGINESIZE']])
train_y = np.asanyarray(train[['CO2EMISSIONS']])
regr.fit(train_x, train_y)

# The coefficients
print ('Coefficients: ', regr.coef_)
print ('Intercept: ',regr.intercept_)
coef = regr.coef_[0][0]
inter = regr.intercept_[0]
plt.scatter(train.ENGINESIZE, train.CO2EMISSIONS,  color='blue')
plt.plot(train_x, regr.coef_[0][0]*train_x + regr.intercept_[0], '-', color='orange')
plt.xlabel("Engine size")
plt.ylabel("Emission")
plt.annotate('y = {0:.0f} x + {1:.0f}'.format(coef, inter), xy=(5, 200))
plt.show()

Image

Evaluation:

from sklearn.metrics import r2_score

test_x = np.asanyarray(test[['ENGINESIZE']])
test_y = np.asanyarray(test[['CO2EMISSIONS']])
test_y_ = regr.predict(test_x)

print("Mean absolute error: %.2f" % np.mean(np.absolute(test_y_ - test_y)))
print("Residual sum of squares (MSE): %.2f" % np.mean((test_y_ - test_y) ** 2))
print("R2-score: %.2f" % r2_score(test_y , test_y_) )

Python Code Multiple Linear Regression

Modeling:

from sklearn import linear_model
regr = linear_model.LinearRegression()
x = np.asanyarray(train[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB']])
y = np.asanyarray(train[['CO2EMISSIONS']])
regr.fit (x, y)
# The coefficients
print ('Coefficients: ', regr.coef_)
print(f'Intercept: {regr.intercept_}')

Prediction:

y_hat= regr.predict(test[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB']])
x = np.asanyarray(test[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB']])
y = np.asanyarray(test[['CO2EMISSIONS']])
print("Residual sum of squares: %.2f"
      % np.mean((y_hat - y) ** 2))

# Explained variance score: 1 is perfect prediction
print('Variance score: %.6f' % regr.score(x, y))

Python Code Non-Linear Regression

Create a train and a test dataset:

msk = np.random.rand(len(df)) < 0.8
train = cdf[msk]
test = cdf[~msk]

Polynomial Regression

from sklearn.preprocessing import PolynomialFeatures
from sklearn import linear_model
train_x = np.asanyarray(train[['ENGINESIZE']])
train_y = np.asanyarray(train[['CO2EMISSIONS']])

test_x = np.asanyarray(test[['ENGINESIZE']])
test_y = np.asanyarray(test[['CO2EMISSIONS']])


poly = PolynomialFeatures(degree=2)
train_x_poly = poly.fit_transform(train_x)
train_x_poly

fit_transform takes our x values, and output a list of our data raised from power of 0 to power of 2 (since we set the degree of our polynomial to 2).

From \(y = b + \theta_1 x + \theta_2 x^2\) to \(y = b + \theta_1 x_1 + \theta_2 x_2\).

clf = linear_model.LinearRegression()
train_y_ = clf.fit(train_x_poly, train_y)
# The coefficients
print ('Coefficients: ', clf.coef_)
print ('Intercept: ',clf.intercept_)
plt.scatter(train.ENGINESIZE, train.CO2EMISSIONS,  color='blue')
XX = np.arange(0.0, 10.0, 0.1)
yy = clf.intercept_[0]+ clf.coef_[0][1]*XX+ clf.coef_[0][2]*np.power(XX, 2)
plt.plot(XX, yy, '-r' )
plt.xlabel("Engine size")
plt.ylabel("Emission")
plt.show()

Image

Evaluation:

from sklearn.metrics import r2_score

test_x_poly = poly.fit_transform(test_x)
test_y_ = clf.predict(test_x_poly)

print("Mean absolute error: %.2f" % np.mean(np.absolute(test_y_ - test_y)))
print("Residual sum of squares (MSE): %.2f" % np.mean((test_y_ - test_y) ** 2))
print("R2-score: %.6f" % r2_score(test_y,test_y_ ) )

Non Linear Regression Analysis

Check relation:

plt.figure(figsize=(8,5))
x_data, y_data = (df["Year"].values, df["Value"].values)
plt.plot(x_data, y_data, 'ro')
plt.ylabel('GDP')
plt.xlabel('Year')
plt.show()

Image

Choose a logistic function that fits:

\[ \hat{Y} = \frac1{1+e^{\beta_1(X-\beta_2)}}\]

X = np.arange(-5.0, 5.0, 0.1)
Y = 1.0 / (1.0 + np.exp(-X))

plt.plot(X,Y) 
plt.ylabel('Dependent Variable')
plt.xlabel('Independent Variable')
plt.show()

Image

Building the Model

def sigmoid(x, Beta_1, Beta_2):
     y = 1 / (1 + np.exp(-Beta_1*(x-Beta_2)))
     return y

Find the best parameters for our model:

# Lets normalize our data
from scipy.optimize import curve_fit
xdata =x_data/max(x_data)
ydata =y_data/max(y_data)

popt, pcov = curve_fit(sigmoid, xdata, ydata)
#print the final parameters
print(" beta_1 = %f, beta_2 = %f" % (popt[0], popt[1]))

x = np.linspace(1960, 2015, 55)
x = x/max(x)
plt.figure(figsize=(8,5))
y = sigmoid(x, *popt)
plt.plot(xdata, ydata, 'ro', label='data')
plt.plot(x,y, linewidth=3.0, label='fit')
plt.legend(loc='best')
plt.ylabel('GDP')
plt.xlabel('Year')
plt.show()

Image

Evaluation Metrics in Classification

Classification Algorithms

  • Decision trees

  • Naive Bayes

  • Linear Discriminant Analysis

  • k-Nearest Neighbor

  • Logistic Regression

  • Neural Networks

  • Support Vector Machines (SVM)

  • comparing actual labels \(y\) with predicted labels \(\hat y\)

Jaccard index

  • \(y\): Actual labels
  • \(\hat y\): Predicted labels

Higher accuracy = higher Jaccard index

F1-Score

  • Precision = TP / (TP + FP)
  • Recall (True positive rate) = TP / (TP + FN)
  • F1-Score = 2 * (Precision * Recall) / (Precision + Recall)

The classifier with F1-score close to one means more ideal.

Log Loss

Output is a probability value between 0 and 1.

Lower log loss has better accuracy.

K-Nearest Neighbors

Algorithms

  • A method for classifying cases based on their similarity to other cases
  • Cases that are near each other are said to be ‘neighbors’
  • based on similar cases with same class labels are near each other
  • can also be used for regression

Algorithms

  1. Pick a value for K
    1. K = 1: might choose the noise
    2. K = 20: overly generalized
    3. Testing the accuracy of data loss Image
  2. Calculate the distance of unknown case from all cases
    1. for 3 variables: Image
  3. Select the K-observations in the training data that are “nearest” to the unknown data point
  4. Predict the response of the unknown data point using the most popular response value from the K-nearest neighbors

Python Code for K-Nearest Neighbors

Imported libraries:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn import preprocessing
%matplotlib inline

import pandas as pd
import numpy as np
import mgt2001

from matplotlib import pyplot as plt
import matplotlib.cm as cm
import matplotlib.gridspec as gridspec
from matplotlib.ticker import MaxNLocator
import matplotlib.mlab as mlab
%matplotlib inline

plt.style.use('ggplot') # refined style

Normalize the Data

# Normalize Data first
X = preprocessing.StandardScaler().fit(X).transform(X.astype(float))
X[0:5]

Train Test Split

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

Training

from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics

k = 4
#Train Model and Predict  
neigh = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)

yhat = neigh.predict(X_test)

print("Train set Accuracy: ", metrics.accuracy_score(y_train, neigh.predict(X_train)))
print("Test set Accuracy: ", metrics.accuracy_score(y_test, yhat))

For every other \(k\)s:

Ks = 10
mean_acc = np.zeros((Ks-1))
std_acc = np.zeros((Ks-1))

for n in range(1,Ks):
    
    #Train Model and Predict  
    neigh = KNeighborsClassifier(n_neighbors = n).fit(X_train,y_train)
    yhat=neigh.predict(X_test)
    mean_acc[n-1] = metrics.accuracy_score(y_test, yhat)

    
    std_acc[n-1]=np.std(yhat==y_test)/np.sqrt(yhat.shape[0])

mean_acc

Including plot:

plt.plot(range(1,Ks),mean_acc,'g')
plt.fill_between(range(1,Ks),mean_acc - 1 * std_acc,mean_acc + 1 * std_acc, alpha=0.10)
plt.fill_between(range(1,Ks),mean_acc - 3 * std_acc,mean_acc + 3 * std_acc, alpha=0.10,color="green")
plt.legend(('Accuracy ', '+/- 1xstd','+/- 3xstd'))
plt.ylabel('Accuracy ')
plt.xlabel('Number of Neighbors (K)')
plt.tight_layout()
plt.show()

Image

Decision Trees

Introduction

Example: Image

  1. Choose an attribute from your dataset
  2. Calculate the significance of attribute in splitting of data
  3. Split data based on the value of the best attribute
  4. Go to step 1

Building Decision Trees

Choose the attribute that has:

  • More predictiveness
  • less impurity
  • lower entropy
    • measure of randomness or uncertainty
    • the lower the entropy, the less uniform the distribution, the purer the node
    • Image
    • the model/package will calculate it for us

After trying out every attribute in the dataset, how do we determine which attribute is the best?

Image

Answer: The tree with the higher information gain after splitting.

Information gain: the information that can increase the level of certainty after splitting \(\text{Information Gain} = \text{(Entropy before split)} - \text{(weighted entropy after split)}\)

Python Code

from sklearn.tree import DecisionTreeClassifier

Sklearn Decision Trees do not handle categorical variables.

from sklearn import preprocessing
le_sex = preprocessing.LabelEncoder()
le_sex.fit(['F','M'])
X[:,1] = le_sex.transform(X[:,1]) 


le_BP = preprocessing.LabelEncoder()
le_BP.fit([ 'LOW', 'NORMAL', 'HIGH'])
X[:,2] = le_BP.transform(X[:,2])


le_Chol = preprocessing.LabelEncoder()
le_Chol.fit([ 'NORMAL', 'HIGH'])
X[:,3] = le_Chol.transform(X[:,3]) 

X[0:5]

Setting up the Decision Tree:

from sklearn.model_selection import train_test_split
X_trainset, X_testset, y_trainset, y_testset = train_test_split(X, y, test_size=0.3, random_state=3)

Modeling

drugTree = DecisionTreeClassifier(criterion="entropy", max_depth = 4)
# drugTree # it shows the default parameters

drugTree.fit(X_trainset,y_trainset)

Making Prediction

predTree = drugTree.predict(X_testset)

print (predTree [0:5])
print (y_testset [0:5])

Evaluation

from sklearn import metrics
import matplotlib.pyplot as plt
print("DecisionTrees's Accuracy: ", metrics.accuracy_score(y_testset, predTree))

Visualization

from  io import StringIO
import pydotplus
import matplotlib.image as mpimg
from sklearn import tree
%matplotlib inline 
dot_data = StringIO()
filename = "drugtree.png"
featureNames = my_data.columns[0:5]
targetNames = my_data["Drug"].unique().tolist()
out=tree.export_graphviz(drugTree,feature_names=featureNames, out_file=dot_data, class_names= np.unique(y_trainset), filled=True,  special_characters=True,rotate=False)  
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
graph.write_png(filename)
img = mpimg.imread(filename)
plt.figure(figsize=(100, 200))
plt.imshow(img,interpolation='nearest')
plt.show()

Image

Python Code for Logistic Regression

Normalization


X = np.asarray(churn_df[['tenure', 'age', 'address', 'income', 'ed', 'employ', 'equip']])
X[0:5]

y = np.asarray(churn_df['churn'])
y [0:5]


from sklearn import preprocessing
X = preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]

Train/Test Split

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

Modeling (Logistic Regression)

  • C parameter indicates inverse of regularization strength which must be a positive float. Smaller values specify stronger regularization.
  • predict_proba returns estimates for all classes, ordered by the label of classes.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
LR = LogisticRegression(C=0.01, solver='liblinear').fit(X_train,y_train)
LR

yhat = LR.predict(X_test)
yhat

yhat_prob = LR.predict_proba(X_test)
yhat_prob

Evaluation

Jaccard Index

from sklearn.metrics import jaccard_score
jaccard_score(y_test, yhat,pos_label=0)

Confusion Index

from sklearn.metrics import classification_report, confusion_matrix
import itertools

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, yhat, labels=[1,0]) # result
# np.set_printoptions(precision=2)


# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['churn=1','churn=0'],normalize= False,  title='Confusion matrix')

print (classification_report(y_test, yhat))

Log Loss

from sklearn.metrics import log_loss
log_loss(y_test, yhat_prob)

Lower log loss means better accuracy.

Trying different solvers:

# write your code here

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

lls = list()
solvers=['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']

for solver in solvers:
    LR = LogisticRegression(C=0.01, solver=solver).fit(X_train,y_train)
    yhat = LR.predict(X_test)
    yhat_prob = LR.predict_proba(X_test)
    print(yhat)
    ll = log_loss(y_test, yhat_prob)
    lls.append(ll)
    print(f'{solver}\'s Log Loss: {ll}')
    
print(f"{min(lls)} at {solvers[lls.index(min(lls))]}")

Support Vector Machine

Introduction

SVM is a supervised algorithm that classifies cases by finding a separator.

  1. Mapping data to a high-dimensional feature space so that data points can be categorized even when the data are not otherwise linearly separable
  2. Finding a separator

Basically, SVMs are based on the idea of finding a hyperplane that best divides a data set into two classes as shown here. One reasonable choice as the best hyperplane is the one that represents the largest separation or margin between the two classes. So the goal is to choose a hyperplane with as big a margin as possible. Examples closest to the hyperplane are support vectors. It is intuitive that only support vectors matter for achieving our goal. And thus, other trending examples can be ignored. We tried to find the hyperplane in such a way that it has the maximum distance to support vectors.

SVM is good for image analysis tasks, such as image classification and hand written digit recognition. Also, SVM is very effective in text mining tasks, particularly due to its effectiveness in dealing with high-dimensional data. For example, it is used for detecting spam, text category assignment and sentiment analysis. Another application of SVM is in gene expression data classification, again, because of its power in high-dimensional data classification. SVM can also be used for other types of machine learning problems, such as regression, outlier detection and clustering. I’ll leave it to you to explore more about these particular problems. This concludes this video, thanks for watching.

Python Code

Split

feature_df = cell_df[['Clump', 'UnifSize', 'UnifShape', 'MargAdh', 'SingEpiSize', 'BareNuc', 'BlandChrom', 'NormNucl', 'Mit']]
X = np.asarray(feature_df)

cell_df['Class'] = cell_df['Class'].astype('int')
y = np.asarray(cell_df['Class'])

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

Modeling

Four kernal types:

  1. Linear
  2. Polynomial
  3. Radial Basis Function (RBF)
  4. Sigmoid
['linear', 'poly', 'rbf', 'sigmoid', 'precomputed']
from sklearn import svm
clf = svm.SVC(kernel='rbf')
clf.fit(X_train, y_train) 

yhat = clf.predict(X_test)

Evaluation

from sklearn.metrics import classification_report, confusion_matrix
import itertools
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

Confusion Matrix

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, yhat, labels=[2,4])
np.set_printoptions(precision=2)

print (classification_report(y_test, yhat))

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['Benign(2)','Malignant(4)'],normalize= False,  title='Confusion matrix')

              precision    recall  f1-score   support

           2       1.00      0.94      0.97        90
           4       0.90      1.00      0.95        47

    accuracy                           0.96       137
   macro avg       0.95      0.97      0.96       137
weighted avg       0.97      0.96      0.96       137

Image

F1-Score

from sklearn.metrics import f1_score
f1_score(y_test, yhat, average='weighted') 

Jaccard

from sklearn.metrics import jaccard_score
jaccard_score(y_test, yhat,pos_label=2)

Clustering